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ABSTRACT 

This paper shows that Garfinkel and Gramlich' s 
conclusion about the average performance of the contractors is not 
supported by the data. The discussion focuses on three issues: (1) 
The study was conducted in such a way that performance contracting 
schools and the comparison schools were generally not comparable in 
terms of initial achievement, socioeconorriic status, and a host of 
other demographic and social variables that may have influenced the 
results. Since the contractors were generally assigned to the schoolr^ 
with the lower achieving students, who also had lower S.E.S. and 
family income, the effect of the noncomparability of the comparison 
schools was to negatively biasi the estimated effect of performance 
contracting. (2) The statistical adjustments used by Garfinkel and 
Gramlich were inadequate to offset these biases. (3) The data base 
was open to a wide variety of potential biases that were not 
assessed; e.g., the testing conditions (at some schools) were 
terrible, the tests may have been highly speeded and thus not 
measuring reading and mathematics ability per se, and there was no 
control on the control group in that they may have been trying harder 
to outdo the experimentals . (Author/JM) 
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A Critique of the Report 
by Irv Garfinkel and Edward M. Gramlich 

entitled 

A STATISTICAL ANALYSIS OF THE GEO EXPERIMENT IN PEREOR^IANCE. CONTRACTING 

Edward F. O'Connor, Jr. 
Educational Testing Services 

and 

Stephen?. Klein 
University of California, Los Angeles 

The GEO experiment 'Vas designed to test whether private cor-jpanies 
operating with their existing technologies on an incentive basis coul(3 provide 
better^ remedial education to poor students than the normal schools.'' (G & G, p.l) 
It v/as not expected that all private companies could outperform all public 
schools under all circumstances. On the contrary, different companies were 
.expected to use different methods under different circumstances and to achieve 
varying deg;rees of success. 

The central irsue in the GEO experiment was whether performance- 
contracting was worth pursuing, i.e., does it appear likely that some con-- 
tractors can significantly out-perform some public schools? There is 
considerable evidence in the Garfinkel and Gramlich report that some contractors 
did indeed out-perform the public schools at certain sites. Despite this 
evidence^ Garfinkel and Gramlich concluded that '*perf ormance contractors who 
participated in the experiment do not currently have the capability- of bring- 
ing about any great improvement in the educational status of disadvantaged 
children (p. 27)," The basis for their conclusion was that their analysis 
indicated that the average performance of the contractors was very similar 
to the average performance of the public schools. In other words, they are 



implying a conclusion about the individual contractors on the basis of the 
group's performance as a whole — averaging the successes and the failures. 
Whether contractors as a whole out-performed public schools was essentially 
a non-question: it was anticipated that some contractors would fail. The 
essence of the performance contract concept is that the performance contractors 
who consistently fail will have to change their methods or go out of business 
(as many of them have). The inajor issue V7as whether there is evidence that 
some performance contractors can out-perform some public schools. The OEO 
experiment was not adequately designed to ask this question, but there is 
considerable evidence in the data that some contractors consistently out- 
performed the public schools. We will show in this paper that Garfinkel and 
Gramlich's conclusion about the average performance o£ the contractors is not 
supported by the data. This discussion will focus on three issues. 

1. The study was conducted in such a way that performance contracting 
schools and the comparison schools were generally not comparable 
in terms of initial achievement, socioeconomic status, and a 

■ hos.t of other demographic and social variables that may have 
influenced the results. Since the contractors were generally 
assigned to the schools V7ith the lower achieving students, who 
also had lower SES and family income, the effect of the non- 
comparability (or mis-matching) of the comparison schools was to • 
negatively bias uhe estimated effect of performance contracting. 

2. The statistical adjustments used by Garfinkel and Gramlich were 
inadequate to offset these biases. 

3. The data base was open to a wide variety of potential biases 
that were not assessed; e.g., the testing conditions (at some 
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schools) '-rere terrible, trie tests ray have been highly speeded 

and thus net neasuriug readi:..r: and Tnathonatics .-ibility rer sc, 
" ' * *■ " arid there was no Control on the control group in that they may 

h^ive been trying harder to outdo the e>:perlinentals (called the 

''John Henry Effect/' Saretsky, 1972.) 
Inadequacies in the Experimental Design 

In an ideal experiment, the random assignment of students to the 
experimental and control groups allows the researcher to state with a high 
degree of confidence that the posttest differences betvoen the experimental 
and control groups are due to something which occurred between the time of the 
random assignment and the posttest. But in. many educational settings, the 
random assignment of students is not possible and alternative strategies must 
be employed. One alternative approach is to randomly assign comparable 
schools to the experimental and control groups. Campbell and Stanley (1963) 
refer to this approach as a "quasi-experimental" design. Another alternative 
is to choose the experimental schools and then find comparable '^control" 
schools. Campbell and Stanley refer to this approach as a "pre-experimental" 
design because it has questionable validity under even the best of circum- 
stances. This latter approach was the one used in the OEO study despite 
very apparent and unfavorable circumstances, namely, the fact that the control 
schools had more able and affluent students. 

Why was the pre-experimental design chosen? The details in this 
particular . case are probably quite complex and involve a variety of factors, 
personalities, and policy decisions. It is interesting to note, however, that 
most federal evaluation efforts have put the less able and affluent in the 



experisiental g.rouT>, e.g,, Headstart and Title I, Perhaps the decision-riakers 
feel that the treatnant u-ill be beneficial and it should go to th^ students 
who appear to need it and can profit from it most- Unfortunately, r.his very 
method of assigrunent ensures that the comparison schools and students are not 
comparable in a variety of known and unjcnov^ ways. Thus, the data are often 
"adjusted" statistically in an attempt to control for the non- comparability. 
For a variety of technical reasons that v;ill be discussed later in this 
report, the initial differences between the treatment groups are under-adjusted. 
Since the experimental treatment has generally been assigned* to the lower 
achieving schools, the consistent result has been a potential under-estimate 
of the effect oi' social action educational programs, such as Headstart, Title 
1, and the OEO Performance Contracting Study, It must be reiterated,, however, 
that these attempts to make statistical adjustments were initiated only 
because the researchers failed to use the more podyrerful research designs 
available to them. In the OEO study, for example, a much more powerful 
design would have been to randomly assign schools to one of the two treat-- 
ments at each site. This v;ould have resulted in essentially the same 
initial position of the two groups (experimental and control) for at least 
the aggregate sample. Such a design would have essentially eliminated many 
of the statistical and logical problems which will be discussed .in connection 
with OEO' s analyses. 

In addition to the non-random assignment of school, there were other 
inadequacies in carrying out the experimental design. Different criteria 
were used to select the experimenta,! and control school. ''Generally speaking, 
the most deficient school or schools were selected as the experimental school (s) 
and the next most deficient as the control schools (Ray, 1972, p. 8)/' 



Vithin the experir.entai and control schools, different criteria vrere soTTjetimcs 
used to select students for the eicperir^ental and control groups. It is 
likely that in sorce cases less care vzas employed to select the control 
students since they vere not going to benefit from the e>:periTr:.?ntal program. 
Attrition was handled differently in the experir.entai and control groups. 
"Thus, of the 106 site/grade/subject area combinations at the secondary 
level, the control group has a higher grade equivalency entry in 84, or 
approximately 80 percent of these combinations (Ray, 1972, p. 34)," 

Despite the fact that median income and other family background data 
was needed as covariates for some of the statistical adjustment procedures, 
the overall survey return for th^ parents' questionnaire was less than fifty 
percent, varying from zero percent at some schools to an estimate of over one 
hundred percent at a Dallas eight grade class. "This can be only accounted 
for in terms of Dallas eight grade adding more control stadents after the initial 
master list was created. The situation at Dallas was probably replicated at 
other sites; hence response rates should be interpreted cautiously" (Ray, 
1972, Appendix B) . It must be added that not only the return rate but data 
themselves must be interpreted cautiously. 

Racial data was not collected at four of the sites. For ten of the 
fourteen sites where racial data was collected ^ the percent of v/hites in the 
control group exceeded the percentage of whites in the experimental groups 
(in one school 37% higher) • 

Median income was also used in the adjustment methods. The response 
rate varied from lows of zero percent for the Rockland control group and 
eight percent at the Philadelphia control group to a high of 98 percent for 
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the laft control ^roup. !!edian incorie varied by better than two to one at 
some sites (Anchorage : second grade eypv^.rinentsl group, $8,000; second 
grade control group $16,950; HcCDir.b: third grade experimental group, $4,375; 
third grade control group, $9,500). 
Inadequacies in the Statistical Analysis 

In most evaluation studies in which matching or statistical adjustments 
have been used to control for pre-existing differences in the treatment groups, 
there has been no outbade criterion which could be used to validate the match- 
ing or adjustment procedures. For example, one can choose to believe 
(Circirelli, 1970; Evans & Schiller, 1970) or not believe (Campbell & Erlebacher,* 
1970) that the matching procedures v/ere adequate in the Westinghouse-Ohio 
evaluation of Headstart. Thedata needed to validate the procedures simply 
was not available. Fortunately in the OEO study, there is an outside criterion 
vhlch can be used to examine the validity of the adjustment procedures* This 
opportunity came about as a result of the fact that classes with each school ' 
differed greatly in their pretest scores. Consequently some experimental 
classes were superior to their control counterparts even though the experimental 
school as a whole scored below its cbnttol school. This meant that for every 
kind of statistical adjustment employed, a test could be run to examine its 
potential efficacy. Unfortunately, this is a one-sided test in that it can 
only say whether a particular adjustment is biased in a given manner; it cannot 
say that the adjustment is unbiased. Furthermore the test is only sensitive to 
relatively large biases. The nature of this test is illustrated by the 
following table: 
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This table compares the initial status of the sites with their final 
status on the ''adjusted" posttest treat:nent effects. For each grade the 
eighteen sites were ranked according to the difference between the experimental 
and group on one initial status variable » either the reading or mathematics 
pretest, or median family income, or percent white, with the largest differ- 
ence in favor of the experjjuental group listed first and the largest difference 
favoring the control group listed last.* The sites were then split into 
thirds, the third most favorable to the experimental group, the middle third, 
and the third most favorable to the control group. The same procedure was. 
followed for each of Garfinkel and Gramlich's five treatment effect estimates 
and the Bat telle estimate. A three by three contingency table was prepared 
with the initial status differences determining the rows and the "adjusted** 
posttest differences determining the columns.. The tables were prepared by 



*There was a small difference in the samples used by Battelle and by 
OEO for their analyses. These differences V7ere small enough to be ignored 
in this analysis. 
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sunminc: across all six grades in order to obtain a lari.e onoucrh sp.r^ple si^ic- 
to test for aignif icance* A significant chi-sqjare n^e.-^ns that the adjusted 
posttest differences are not independent of initial status and hence are 
biased. A sa-nmar^,^ of the chi-squares are pre53nt<?d in the Table 2 in the 
appendix. The six estimates of the treatment effect for reading are designated 
through and the six estimates for math are designated as through N^. 

The first set of chi-squares compares each twelve treatment effect 
estimates xjith initial status on the pretest of the same subject. All of 
these chi'-squares were non-significant. The second set compares each of the 
twelve estimates with the opposite pretest. Four of these chi-squares are 
significant at the .05 level or better. The third set compares treatment 
effect estimates with median family income. Five of the chi-squares were 
significant, all of them in mathematics. The fourth set compares the treat- 
ment effect estimates with percent white. Two of these chi-squares were 
significant, both of them in reading. 

The tables vith significant chi squares are presented in Tables 4a 
through Ak in .the appendix. All of these tables show a tendency. for the 
treatment effect estimates to be positively related to initial status, that 
is, the groups with the largest estimated treatment effect favoring the 
experimental group tended to be those groups with the largest initial status 
difference favoring the experimental group. This positive correlation between 

**The first graders took only a single pretiBst combining reading and 
math readiness. Consequently their data v7as excluded from the "appropriate 
pretest" chi square tables. Othenvase all six grades are included in each 
table. 
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initial status and estimated treaty- -^nt effect ir.eans the estimates are biasi-d. 
Five of the si:: irethods for estir.aning the t.reatr:ent effects shoA-ed at 
least one significant chi-'square. The one; exception, surprisingly enough, Xv^as 
the unadjusted raw score F.ean gain dif fei'ence , an estinate which ve and 
Garfinkel and Granlich consider to be biased on a priori grounds, which 
will be discussed later. 

The six methods to estir.ate the treatment effect are described in 

Table I. 

Why the Statistical Adjustments Failed 

The simplest explanation for the failure of the statistical adjust- 
ments may be the most profound: the schools were simply different* They 
were different on the achievement pretests, on parental income, on parental 
education, and on a variety of other known and unknown variables. There is 
simply no known statistical procedure that can be counted on to make the 
appropriate adjustment in such cases (Lord, 1967, 1969). In this particular 
case, we have demonstrated that the adjustments were not appropriate. All of 
the statistical adjustments rest on* a series of assumptions and in this section 
we will demonstrate that these assumptions were not met in the OEO study. 
The effect of failing to meet these assumptions was to systematically under- 
estimate the effect of performance contracting. 

All the adjustment procedures are essentially attempts' to predict 
what the difference betx^een the treatment groups would have been in the 
absence of the treatment. .If the predicted difference and the obtained 
difference are about the same, one concludes that there v/as no treatment 
effect. On the other hand, if. the predicted and the obtained differences 
are sufficiently discrepant, one concludes that the discrepancy iK due to 
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the effect of thr. trestnent. In short> the ;:djustn:cnt procedures require 
certain assumptions about tha prcdictabiliLV of the aizaaeniic grov:th of tiic 
treatment groups (i.e., what would havs happened if v:hat did happen had 
not happened) . 

The simplest assumption is that the experimental and control groups 
would have grown by the same number of raw score points if they had received 
the same treatment. Garfinkel and Gramlich present the mean gain difference 
on line 1 of Table III. For example, on the reading test the first grade 
experimental group gained one point more than the first grade control group 
did. If we could assume that the two groups would have gained the same amount 
in the absence o£ the experimental treatment (performance contracting) , the 
one point difference could be attributed to performance contracting. But the 
assumption of equal gains seems unlikely in light of the data indicating that 
the experimental group had lower pretest scores, family income, and SES than 
did students in the control group. There is sufficient evidence in the OEO 
report and press release, as well as in other research (Coleman, 1965; 
Hubert, 1972) that students who start out low on these dimensions have a 
slower rate of growth in skill development than do students who are more 
able and/or come from more affluent homes. It V70uld be eypected, therefore , 
that th e ex perimental group x^ ould Rain less in test s cores than the control 
Kroup if there were no effect of perf o rmance contracting ^. Thus, the mean 
unadjusted gain differences are negatively biased estimates of the effect of 
performance contracting, i.e., they underestimate its impact. It is interest- 
ing to note, therefore, that since the unadjusted mean gains were essentially 
.the same for the two groups, there is actually some evidence to f^upport the 



11 



contention that yjerformnce conti*actin'^ hac^ a g^norally positive impact on 
scores. 

Garfihkel and Gramlich als.o r ncluded that the mean gain differences ' 
were negatively biased estimates of the cxporinental effect. On line 4 of 
Table III they presented the **adjusted" mean gain differences- This adjustment 
process assumes that the pretest-posttest relationship is the same between 
groups and within groups. This assumption holds reasonably well for randomized 
groups, but is not necessarily true for non-randomized groups. In fact, the 
between-group and within-group relationships inay even have different signs 
(Robinson, 1950). The assumption that the between-group and within-group 
relationships would have been identical in the absence of the treatment effect 
is kno\m as the "ecological fallacy.'* This problem has usually been discussed 
in terms of the difficulty of inferring individual behavior from the behavior 
of group averages, but the general principle is the same: variables do not 
necessarily affect total group means in the same way that they affect sub- 
groups or individuals within this total group. For a fuller treatment of this 
problem, see Selvin (1958), Cartwright (1969), and Hannan (1971). 

Garfiiikel and GrainJLich presented three regression estimates of the 
treatment effect on lines 2, 3, and 5 of Table III. These estimates depend 
on the assumption that the between-groups and within-gioup regression slopes 
would have been identical in the absence of any treatment effect, This is 
another example of the ecological fallacy. O'Connor (1972) demonstrates that 
estimates of school treatment effects can be biased v/hen they are based on 
'the within-group regression slopes. ; 

Even if we were to ignore the ecological fallacy (and assume that 
the within-group and between-groups regression slopes would have been 
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iclcintical) , the Gnrfi^ikel arid Granlich analz-sis v;oul<i still be invalid for 
.the following throe reason;^: 

* . ■ ■ 1. Specification error 

2. Errors of measurement 

3. Heterogeniety of the within-group regression slopes. 

The first and third reasons apply to all five of the estimates by 
Garfinkel and Gramlich and to the Battelle estimate and the second reason 
applies to the second and third estimate presented by Garfinkel and Gramlich 
and to the Battele estimate. 

"Specification error" occurs v/hen the treatment groups are different 
on one or more variables which are correlated with achievement and which are 
not used in adjustment equations. For example, we know that parental income 
is related to achievement and that the experimental and control groups were 
different on average parental income. If parental income, initial achievement 
(the prete:St scores) > and the treatment (performance contracting) were the 
only variables that affected final achievement, the correct mathematical model 
would be 5 / • 

Final "'Achievement = initial achievement effect + parental income 

effect + the treatment effect. 
In such a case, the following model would be an oversimplification of reality 
and would" produce a biased estimate of the treatment effect: 

Final Achievement = initial achievement effect + the estimated 

treatment effect. 

•the estimated treatment effect in the second equation would equal the actual 
treatment effect plus part of the parental income effect. In other words, 
the second equation gives the treatment credit for part of the parental 



income effect. The bias resulting from the failure to include, (specify) 
parental- income in the equation is oue form of ^'specif ication t-^.rror." 

Garfinkel and Gramlich^s first, second, fourth and fifth estiir.ates 
are based on equations which include only the initial achievement scores > 
not SES, parental income, parental education, or other variables on which 
the experimental and control groups differ. Since the control group was 
.initially higher on these variables, the probable result of the specification 
error was to underestimate the effect of performance contracting. 

The estimates on line 3 were based on an equation which included 
additional variables such as family income, parental education, race, sex, and 
age. However, the equation did not correct for errors of measurement and did 
not include all of the variables which might be related to achievement (e.g., 
Coleman Report, 1966). Further, the data were too incomplete (55% response 
rate for family income, p. A) to place a high reliance on the results (parents 
who return questionnaires are likely to be different than those who do not). 

The estimates on lines 2 and 3 were based on equations which did not 
correct for errors of measurement in the pretest data (initial achievement 
scores, parental income, etc.)« (Note: No correction for errors of measure- 
ment was required for the estimates on line 1. However, these estimates are 
likely to be biased by specification error.) Since the control group %rd 
higher initial scores, the probable result of failing to make this correction 
was to underestimate the effect of performance contracting. 

The estimate on lines 4 and 5 are :jorrected for error3 of measurement 
in the pretest achievement scores but they are not adjusted for the. effect of 
the variables such as parental income and education which are knam to 
influence achievement. Although the estimates are labeled differently, both 



15 



essentially regression coefficients adjusted for errors of measurement according 
to a similar si^X of assumptions. The probable result of the specif icalion 
errors in these estimates is to again underestimate the effect of performance 
contracting. 

Another way to look at estimates which are ^'adjusted" for the effect 
of errors of measurement is to realise that the correction, at best, produces 
the same estimate that one would have obtained with a perfectly reliable 
measure of initial status. Lord (1967) demonstrated that even with a perfectly 
reliable measure of initial status, the estimated treatment effects are not a 
reliable indicator of the actual treatment effect when the groups are different 
on the pretest measures. 

All of the estimates presented by Garfinkel and Gramlich are based 
on the further assumption that the experimental and control groups have the 
same within-group regression slopes. Table V (G & G) shows that experimental 
and control groups have significantly different regression slopes for at least 
first grade reading and mathematics, and for 7th and 8th grade mathematics 
(i»e., 4 of the 12 grade subject combinations). In addition, the true score 
regression slopes presented in Table II appear to be substantially different 
for 2nd grade reading and 3rd grade reading and mathematics, although no test 
of statistical significance was performed. Without a re-analysis of the data, 
it is not possible to state whether these differences in the regression slopes 
biased the Garfinkel and Gramlich estimates of the treatment effect. Because 
of the other inadequacies of the experimental design, it is probably not 
worthwhile to pursue this re-analysis. 

The differences in the regressions slopes have practical implications 
as well, despite Garfinkel and Gramlich' s statement *'that the differences in 

o i 
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slope bctv/aen experincntal and control students? ir: very slight, bc^ine 
statistically significaat in on]y a few cai.es anu never amounting to much 
quantitatively'* (p. 20). In the four cases where the slopes are significantly 
different, the experimental grou^D h^s the smaller regression slope. This 
indicates that the performance contractors were relatively more successful 
with low initial achievement students than were the public schools. There 
.are two alternative explanations for this finding: (1) the performance 
contractors were successfully concentrating their efforts oi^ the low initial 
achievement students, their designated target, or (2) the difference in the 
slopes; is simply another indication of how really different the experimental 
and control groups were initially in terms of their patterns of academic 
growth. Because the schools were not randomly assigned to experimental and 
control groups, there is no way to definitely choose betx^jeen these two 
alternative explanations. 

The Battelle estimates, and M^, are perhaps the most interesting 
because they attempted to take in consideration possible differences in the 
within-;group regression slopes* Figures 1, 2, and 3 illustrate this approach. 
The Battelle approach compares the two regression slopes at the mean of the 
combined groups. In the Figure la, the two regression slopes are identical 
and hence the estimated treatment effect is zero at the combined group mean 
and at every other point. 

Figure 2 presents the same means but this time the experimental 
group has a steeper regression slope than the control group. Under these 
circumstances, the Battelle estimates of the treatment effects is positive. 
Figure 3 presents the same means but in this example, the experimental group 
has a flatter regression slope and the Battelle estimate of the treatment 
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effect is negative. Note that in all three cases, the estimated • treatment 
effect ac the experiiDental group pretest mean is zero. In other words, tho 
experimental treatnent effect was not estimated at the pretest mean of the 
group that the performance contractors actually had to work with, but at the 
pretest mean of some hypothetical combined group. 

This appears to be an unreasonable way to evaluate any program. 
Furthermore, if performance contracting is to be used to upgrade the 
performance of the lowest achieving students, it should concentrate on the 
lowest achieving students with the result that the regression slope is 
relatively flat. In contrast, the Battelle approach assigns a positive 
treatment effect to the performance contractors with a steep slope and a 
negative treatment effect to the performance contractor with the flatter 
slope, exactly the opposite of what is socially desirable. 
Intercor relations of the Various Estimates 

We can hypothesize a true model which together with error-free 
data would give us the true treatment effect for each site. All of the 
estimates derived from the six methods discussed earlier will deviate to some 
extent from the true estimates because of specification errors, sampling 
errors, and measurement errors. By examining the inter correlations of the 
various estimates, we can get a very rough indication of how well these 
estimates might be correlated with the true effects. This is somewhat 
analogous to parallel-f ortns reliability coefficients. 

Tables 3a and 3b present these intercorrelations separately for 
each grade and the mean over' all six grades. The correlc.tions range from 
.99 to --.01 for reading and from .995 to. .21 for mathematics. Any correlation 
below .75 can be considered significantly below .90, a minimum acceptable 
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level of reliability for the stsnclardi?:cd achicvcn^.ent tCii^ts (p = .05, one- 
tailed test). Slightly over half of the correlation^^ and half of the meany 
are belov; .75. In short, the estimates are poorly correlated and cannot form 
the basis for a valid determination of the true effects unless a clear case 
that can be made one of the en^'"!- ib "uv^e highly correlated wittr the 

true effect than the others are. To our knowledge, no such case can be made 
with these data. It is worth noting that although we have been discussing the 
estimates of treatment effects at the individual sites, the same argticments 
hold for the estimates of the treatment effects over all sites. Since the 
errors; we have discussed are systematic errors, they affect the overall 
estimates as well as the site estimates. Consequently, no greater confidence 
can be placed in the overall estimates than in site estimates. 

For the purpose of comparison we have presented in Table 3c the 
correlations of the Reading estimates with the Mathematics estimates holding 
the method constant. These range from .89 to -.04. Unfortunately we can not 
determine from this data whether the positive correlations are due primarily 
to the contractors having equal success with both subjects or to shared biases 
in the procedures. 



Sumary and Conclusions 

Even if the schools had been rciudor^.ly assir,n-d to the eMperiinenta] 
and control groups and appropriate statistic;.?! lacthods h.nd heun ceviseti, ii 
is likely that the inadequacies in chr .issignir ^' .iaunlt. 'iid Lhe data 

col?tcc-LiOii vouid ii^'/'-j rendered the data uninterpretable.- Most of these 
inadequacies can be attributed to OEO's unwillingness to alLcwc smfficicnt time 
for the proper planning of this experiment* 

There has been a consistent theme across maTniy fedextaS ri^valuation 
efforts for the experimental groups to contain students who are lower scoring 
and less afluent than the control students. The experimental program are 
assigned to the schools and the students who appear to have ±ha greatest need 
for the pro.gram. The evaluators then look for "'comparable" ii smz rol sciiools. 
Unfortunately, this method of assignment ensues that the compaoxison schools 
and students are not comparable in e variety of kncBvm and unlfeggym ways * 
ConseqxientLy 5 the data must be "adjusted" statistically in am attempt to 
control for th<^ non-comparability. It is clear from tlae OEO'iaata and from 
a number of theoretical articles that these adjmstments can-xaot compeinsate 
for deficiencies in the experimental design. Perhaps it can ':ke argued that 
in the case of large scale national programs such as Title I, there is" no 
politically feasible alternative way to conduct the evaluatixm^ lAatever 
the merits of that argument, it seems clear that in small pilLsnt programs 
such as the OEO experiment, it is feasible to randomly assign- schools to 
the experimental and control conditions. 

We are also troubled by the persistence of the fede:r^.il evaluations 
to make arbitrary summativc evaluations about bdlghly diverse ip^rjoprams . Terms 
like perfonamice contracting. Head Start, Title and cotnpensaatmry readiag 



(2) 



do not: clefi:;v exner inentsl iroatir.ants . Ur.donhtodly , there arc some perf ori^'-nco 
.contractors who could out--perf otm some schools under some cixcumstances . The 
question was v;hich contractors could out-perform schools undor vhat 

conditions? The OEO experiment in perforraance contracting was neither 
designed nor analysed to adequately answer that question- 



Table 1 

A descvipLi.oa of the si,r ir.athods used by CEO and bv r.:ittelle to Estiir.nte the 
Treatmenc Effects. 

and : sir.ply tho difference between the mean gain of the experimental group 
and the mean gain of the control group. 
R2 and M2 * a regression model using same subject pretest as the only covariate 

and allowing for non-linearities, 
R3 and M3 : a regression model using same subject pi^etest and "a vectov of other 
independent variables including average family income, education of 
parents, race sex, and age'* as covarlates and allowing for non- 
linearities in the relationship with pretest, 
R^ and : Ri and adjusted for **bias." 
R5 and M5 : R2 and M2 adjusted for "bias." 

Rg and M5 : the Battelle estimate derived from a regression model which used 
the same subject pretest as the only covariate, allox^ring for 
differences in the within-group regression slopes. This estimate is 
discussed in more detail in the text. 
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Table 3a: Intercorrelations of the Six Estimates of the Treatment Eff cts' for 
Reading. The means are listed below each column and the number of 



paired sites are present in the parentheses. 
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Table 3b: Intercorrelations of Six Estimates of the Treatment Effects for Mathematics. 
The means are listed below each column and the number of paired sites, 
are present in the parentheses. 
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Table 3c: Correlations of the Estimates of the Reading Treatment Effects with 
the Estimates of the Mathematics Treatment Effects using the same 
method of analysis. 
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